Read from multiple datasets at once

leo · July 4, 2024, 8:34am

Hello,
I wanted to ask if there is a a functionality in h5pyd to retrieve data from multiple datasets/ groups in one call.
Context:
I am asking for performance reasons, especially for reading fairly small subsets. My data forces me to put it in different datasets but I usually need to retrieve data from all datasets.

So far I haven’t found anything in the documentation except an old design document:

github.com

HDFGroup/hsds/blob/master/docs/design/query/md_query.md

# Multi-Domain Queries

------

HSDS currently supports query operations to retrieve values from a given dataset, but often times it would be desirable to query across different datasets in a domain or across multiple domains.  The queries should support logical expressions based on the domain name and path, attribute values, dataset paths, and dataset values.  The implementation of these operations should be able to support queries across 1000's of domains, 1000's of objects per domain, and large datasets (>1GB) efficiently.  This document describes how the HDF REST API could be extended to support this and how the functionality could be implemented.

------

## 1. Introduction

HSDS (https://github.com/HDFGroup/hsds/blob/master/docs/design/hsds_arch/hsds_arch.md) is a web-service written in Python that provides the functionality to read and write HDF data via a REST API (https://github.com/HDFGroup/hdf-rest-api).  The API includes method for domain (the HSDS equivalent of a file) creation/update/delete, object (Group, Dataset, Comitted Datatype) creation/update/delete, value hyper slab and point selection, and other operations that mirror the functionality of the HD5 library. 

There are three types of "queries" currently supported: domain queries, h5path queries, and dataset value queries.  It would be desirable for multi-domain quieres to extend the current syntax as far as possible.



### Domain Queries

This request returns a set of domains within a given folder.

This file has been truncated. show original

Has this been implemented?

Also I have tried to aceess different domains asynchronosly with the python asyncio librarly, which gave me the same runtime as accessing it sequentially. Maybe I made a mistake or does HSDS not support parallel requests from one client?

I appreciate your help.
Leonard

jreadey · July 5, 2024, 12:08pm

Hi, Thanks for your question!

Yes, if you need to retrieve many small datasets, it can be a bit slow since the latencies between each request to HSDS add up.

When you experimented with asyncio, where you using the aiohttp package? Unless your http routines specifically support await, the calls are likely to be made sequentially anyway.

Another practical issue with asyncio is that unless your application is already designed with async processing in mind, it’s hard to bolt on some async functions later on.

Anyway, in order to provide a more practical way for Python users to benefit from parallel processing, we recently added a h5pyd feature to help with this use case: MultiManager. The MultiManager enables applications to read or write multiple selections from multiple datasets in one call. Internally, it uses Python threading to send one http request per selection in parallel.

The code is not yet in an official h5pyd release, but you can get it with a: $ pip install git+https://github.com/hdfgroup/h5pyd. Take a look at some of test code from:

github.com

HDFGroup/h5pyd/blob/master/test/hl/test_dataset.py#L1991


      
                                               data=np.random.rand(*shape))
          
          
        # generate float type, sample float(0.)
                  # check that operation is symmetric (but potentially meaningless)
                  val = float(0.)
                  assert (val == dset) == (dset == val)
                  assert (val != dset) == (dset != val)
          
          

          
@ut.skipIf(config.get('use_h5py'), "h5py does not support MultiManager")
          class TestMultiManager(BaseDataset):
              def test_multi_read_scalar_dataspaces(self):
                  """
                  Test reading from multiple datasets with scalar dataspaces
                  """
                  shape = ()
                  count = 3
                  dt = np.int32
          
          
        # Create datasets
                  data_in = np.array(1, dtype=dt)

and it should be fairly clear how it works. If anything is unclear, please let us know.

Once you’ve tried out MultiManager, I’d be curious to hear what kind of performance benefit you see. In our testing speedup varies quite a bit depending on the number of selections used, the size of the selections, and a host of other factors. Hopefully your application will get a good speedup!

leo · July 8, 2024, 11:18am

Hi jreadey,
I am happy this feature has already been implemented in h5pyd. With the help of the code I was able to write a small benchmark script which compares accessing the data with the multimanager and sequentially. I tested it on effecively a netcdf file with one 5 dimensional variable and the 5 corresponding coordinate axis. So 6 variables in total with one being a lot larger than the other ones. I benchmarked by retrieving one random entry from each of the datasets in order to avoid caching effects.

Time for sequential access: ~400ms
Time with Multimanager: ~100ms

I got a factor of 4 of performance improvement compared to a theoretical limit of 6.

I would assume a large dataset creates a bit of search overhead. So retrieving a value from it takes longer than from the smaller ones. This would mean that accessing equal sized datasets would result better scaling of the MultiManager.

I wrote some generic benchmark code, feel free to use it:

def generate_range(ds_shape: tuple):
    # generate a tuple of random indices for one dataset
    indices = []
    for axis_length in ds_shape:
        index = random.randint(0, axis_length - 1)
        indices.append(index)
    return tuple(indices)


def generate_index_query(h5file):
    # generate a list of index tuples
    query = []
    for ds in h5file.values():
        ds_shape = ds.shape
        indices = generate_range(ds_shape)
        query.append(indices)
    return query


def benchmark_multimanager(h5file, num=10):
    """
    Benchmark retrieving one random entry from every dataset in an h5file 
    using the MultiManager.
    """
    ds_names = list(h5file.keys())
    datsets = [h5file[name] for name in ds_names]
    mn = h5pyd.MultiManager(datsets)

    # prepare queries to exclude from runtime
    queries = []
    for i in range(num):
        query = generate_index_query(h5file)
        queries.append(query)

    # accessing the data
    t0 = time()
    for query in queries:
        results = mn[query]

    runtime = time() - t0
    print(f"Mean runtime multimanager: {runtime/num} ")
    # 100ms for case with 6 datasets


def benchmark_sequential_ds(h5file, num=10):
    """
    Benchmark retrieving one random entry from every dataset in 
    an h5file by sequentially looping through the datasets
    """
    # prepare queries to exclude this code from runtime
    index_lists = []
    for i in range(num):
        index_list = []
        for ds in h5file.values():
            indices = generate_range(ds.shape)
            index_list.append(indices)
        index_lists.append(index_list)

    # accessing the data
    t0 = time()
    for index_list in index_lists:
        for indices, ds in zip(index_list, h5file.values()):
            result = ds[indices]

    runtime = time() - t0
    print(f" Mean runtime sequentially: {runtime/num} ")
    # ~ 400ms for case with 6 datasests

Will the Multimanager be added to the next release?

jreadey · July 9, 2024, 9:23am

Happy to hear that the MultiManager worked so well for you!
Yes, it will be in the next h5pyd release (I might add your benchmark script as well).

jreadey · July 10, 2024, 10:27am

I’ve checked in Leo’s benchmark test to h5pyd here: h5pyd/examples/multi_mgr_benchmark.py at master · HDFGroup/h5pyd · GitHub.

Got the following results on AWS:


 python multi_mgr_benchmark.py 
Mean runtime sequentially: 3.7388 s
Mean runtime multimanager: 0.4490 s

More than a 8x speedup! (ymmv)

Also added a notebook example here: h5pyd/examples/notebooks/multi_manager_example.ipynb at master · HDFGroup/h5pyd · GitHub

leo · July 11, 2024, 10:19am

You tested with a local file right?
If so, I am quite impressed that there was so much to gain even locally. For remote access I would assume that it scales even better.

jreadey · July 11, 2024, 6:10pm

No, this was testing against S3 (and re-starting HSDS to negate any caching effects).
There was significant speedup with local data as well.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Read from multiple datasets at once